Organizing and Searching the World Wide Web of Facts - Step One: The One-Million Fact Extraction Challenge

نویسندگان

  • Marius Pasca
  • Dekang Lin
  • Jeffrey Bigham
  • Andrei Lifchits
  • Alpa Jain
چکیده

Due to the inherent difficulty of processing noisy text, the potential of the Web as a decentralized repository of human knowledge remains largely untapped during Web search. The access to billions of binary relations among named entities would enable new search paradigms and alternative methods for presenting the search results. A first concrete step towards building large searchable repositories of factual knowledge is to derive such knowledge automatically at large scale from textual documents. Generalized contextual extraction patterns allow for fast iterative progression towards extracting one million facts of a given type (e.g., Person-BornIn-Year) from 100 million Web documents of arbitrary quality. The extraction starts from as few as 10 seed facts, requires no additional input knowledge or annotated text, and emphasizes scale and coverage by avoiding the use of syntactic parsers, named entity recognizers, gazetteers, and similar text processing tools and resources.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Open Information Extraction for the Web

1 3 , 8 1 0 , 0 0 0 T u p l e s ? P r i m a r y E n t i t i e s ? R e l a t i o n s F i l t e r i n g Figure 4.2: Open Extraction from Wikipedia: TextRunner extracts 32.5 million distinct assertions from 2.5 million Wikipedia articles. 6.1 million of these tuples represent concrete relationships between named entities. The ability to automatically detect synonymous facts about abstract entities...

متن کامل

Names and Similarities on the Web: Fact Extraction in the Fast Lane

In a new approach to large-scale extraction of facts from unstructured text, distributional similarities become an integral part of both the iterative acquisition of high-coverage contextual extraction patterns, and the validation and ranking of candidate facts. The evaluation measures the quality and coverage of facts extracted from one hundred million Web documents, starting from ten seed fac...

متن کامل

کاربرد هستی شناسی های وب معنایی در نظام های اطلاع رسانی پزشکی

One of the challenges of current medical information systems which is based on keyword searching, is that it may retrieve a large amount of irrelevant information during searching. Also, these systems don't provide interoperability among healthcare systems. For interfacing these challenges, and for the purposes of more interoperability between user and machine, semantic web (web 3) has been d...

متن کامل

Searching for Knowledge Instead of Web Sites

Today’s Web search engines can find Web pages that contain certain keywords. Up to now, however, any advanced information demands that concern facts from multiple Web pages, let alone a logical connection between them, are inherently beyond the answering capabilities of search engines. This is why our approach is to collect information from Web sites and to organize it in a huge knowledge struc...

متن کامل

A Large Scale System for Searching and Browsing Images from the World Wide Web

This paper outlines the technical details of a prototype system for searching and browsing over a million images from the World Wide Web using their visual contents. The system relies on two modalities for accessing images — automated image annotation and NN image network browsing. The user supplies the initial query in the form of one or more keywords and is then able to locate the desired ima...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006